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1. 

A computer program is said to learn from experience E with 

respect to some task T and some performance measure P if its 

performance on T, as measured by P, improves with experience E. 

Suppose we feed a learning algorithm a lot of historical weather 

data, and have it learn to predict weather. In this setting, what is T? 

O The probability of it correctly predicting a future date's 
weather. 

O The weather prediction task. 

O None of these. 

O The process of the algorithm examining a large amount of 
historical weather data. 


i 

point 

2. 
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Suppose you are working on weather prediction, and use a 
learning algorithm to predict tomorrow's temperature (in 
degrees Centigrade/Fahrenheit). 

Would you treat this as a classification or a regression problem? 
O Regression 
O Classification 


point 

3. 

Suppose you are working on stock market prediction, Typically 
tens of millions of shares of Microsoft stock are traded 
(i.e., bought/sold) each day. You would like to predict the 
number of Microsoft shares that will be traded tomorrow. 
Would you treat this as a classification or a regression problem? 
O Classification 
O Regression 


1 

point 

4. 

Some of the problems below are best addressed using a supervised 
learning algorithm, and the others with an unsupervised learning 
algorithm. Which of the following would you apply supervised learning 
to? (Select all that apply.) In each case, assume some appropriate 
dataset is available for your algorithm to learn from. 

I~l Given genetic (DNA) data from a person, predict the odds of 
him/her developing diabetes over the next 1 0 years. 
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l~l Examine the statistics of two football teams, and predicting 
which team will win tomorrow's match (given historical data 
of teams' wins/losses to learn from). 

FI Take a collection of 1 000 essays written on the US Economy, 
and find a way to automatically group these essays into a 
small number of groups of essays that are somehow 
"similar" or "related". 

I~l Examine a large collection of emails that are known to be 
spam email, to discover if there are sub-types of spam mail. 


point 

5. 

Which of these is a reasonable definition of machine learning? 

O Machine learning is the science of programming computers. 

O Machine learning learns from labeled data. 

O Machine learning is the field of allowing robots to act 
intelligently. 

O Machine learning is the field of study that gives computers 
the ability to learn without being explicitly programmed. 
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Linear Regression with One 
Variable 


5 questions 


1 
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1. 

Consider the problem of predicting how well a student does in her 
second year of college/university, given how well they did in their first 
year. 

Specifically, let x be equal to the number of "A" grades (including A-. A 
and A+ grades) that a student receives in their first year of college 
(freshmen year). We would like to predict the value of y, which we 
define as the number of "A" grades they get in their second year 
(sophomore year). 

Questions 1 through 4 will use the following training set of a small 
sample of different students' performances. Here each row is one 
training example. Recall that in linear regression, our hypothesis is 
he(x) = 6 0 + 9 ix, and we use m to denote the number of training 
examples. 



For the training set given above, what is the value of ml In the box 
below, please enter your answer (which should be a number between 
0 and 10). 


Enter answer here 
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2. 

Consider the following training set of m = 4 training examples: 


X 

y 

1 

0.5 

2 

1 

4 

2 

0 

0 


Consider the linear regression model 1iq{x) = 0 o +6\ x. What are 
the values of 0 O and 0 i that you would expect to obtain upon running 
gradient descent on this model? (Linear regression will be able to fit 
this data perfectly.) 


o 

0o 

II 

o 

II 

o 

br 

o 

0o 

= 1,01 = 1 

o 

0o 

= O.5,0i = 0.5 

O 

0o 

= l,0i =0.5 

O 

0o 

= O.5,0i =0 


point 

3. Suppose we set 0 O = — 1, 9i = 0.5. What is he{ 4)? 


Enter answer here 


i 

point 
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4. 

Let / be some function so that 

f{9 o , 9 \ ) outputs a number. For this problem, 

/ is some arbitrary/unknown smooth function (not necessarily the 

cost function of linear regression, so / may have local optima). 

Suppose we use gradient descent to try to minimize /(0 O , 9 \ ) 

as a function of 6$ and 9 \ . Which of the 

following statements are true? (Check all that apply.) 

n If 9q and are initialized at 

a local minimum, then one iteration will not change their 
values. 

I~l Even if the learning rate a is very large, every iteration of 

gradient descent will decrease the value of /(0 O , 0 1 )• 

n If the learning rate is too small, then gradient descent may 

take a very long 

time to converge. 

I~l If 0 O and 0i are initialized so that 0 O = 0i , then by 

symmetry (because we do simultaneous updates to the two 
parameters), after one iteration of gradient descent, we will 
still have 0 O = 0i . 


point 

5. 

Suppose that for some linear regression problem (say, predicting 
housing prices as in the lecture), we 

have some training set, and for our training set we managed to find 
some 0 0 , 0i such that J(0 0 , 0i ) = 0. Which 

of the statements below must then be true? (Check all that apply.) 
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l~l This is not possible: By the definition of J(9 0 , 9i ), it is not 
possible for there to exist 

6q and 9\ so that J(9 0 , 9i ) =0 

l~l For these values of 9 0 and 9\ that satisfy J(9 0 , 9 \ ) = 0, 

we have that he{x W) = yW for every training example 
(aWjyW) 

l~l We can perfectly predict the value of y even for new 
examples that we have not yet seen. 

(e.g., we can perfectly predict prices of even new houses that 
we have not yet seen.) 

I~l For this to be true, we must have 9 0 = 0 and 9 1 = 0 
so that hg(x) = 0 
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1. 

Suppose /77=4 students have taken some class, and the class had a 
midterm exam and a final exam. You have collected a dataset of their 
scores on the two exams, which is as follows: 


midterm exam 

(midterm exam) 2 

final exam 

89 

7921 

96 

72 

5184 

74 

94 

8836 

87 

69 

4761 

78 


You'd like to use polynomial regression to predict a student's final 
exam score from their midterm exam score. Concretely, suppose you 
want to fit a model of the form he(x) = 0 0 +0 \X\+ 0 2 x 2 , where x\ 
is the midterm score and x 2 is (midterm score) 2 . Further, you plan to 
use both feature scaling (dividing by the "max-min", or range, of a 
feature) and mean normalization. 

What is the normalized feature ? (Hint: midterm = 89, final = 96 is 
training example 1.) Please round off your answer to two decimal 
places and enter in the text box below. 


Enter answer here 
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2. 

You run gradient descent for 1 5 iterations 

with a = 0.3 and compute 

J{6) after each iteration. You find that the 

value of J(0) decreases slowly and is still 

decreasing after 1 5 iterations. Based on this, which of the 

following conclusions seems most plausible? 

O Rather than use the current value of a, it'd be more 
promising to try a larger value of a (say a = 1.0). 

O Rather than use the current value of a, it'd be more 
promising to try a smaller value of a (say a = 0.1). 

O a = 0.3 is an effective choice of learning rate. 


1 

point 

3. 

Suppose you have m = 23 training examples with n = 5 features 
(excluding the additional all-ones feature for the intercept term, which 
you should add). The normal equation is 6 = (X T X)~ l X T y. For the 
given values of m and n, what are the dimensions of 6 , X, and y in 
this equation? 

O X is 23x5, j/ is 23x1,0 is 5x1 

O X is 23 x 6, y is 23 x 1, 0 is 6 x 1 

O X is 23 x 5, y is 23 x 1, 6 is 5 x 5 

O X is 23 x 6, y is 23 x 6, 6 is 6 x 6 


i 


https://www.coursera.org/learn/machine-learning/exam/7pytE/linear-regression-with-multiple-variables 


2/4 





3/27/2016 


Linear Regression with Multiple Variables | Coursera 


| point I 

4. 

Suppose you have a dataset with m = 50 examples and n = 15 
features for each example. You want to use multivariate linear 
regression to fit the parameters 9 to our data. Should you prefer 
gradient descent or the normal equation? 

O The normal equation, since it provides an efficient way to 
directly find the solution. 

O Gradient descent, since (X T X) will be very slow to 
compute in the normal equation. 

O Gradient descent, since it will always converge to the optimal 
9. 


O The normal equation, since gradient descent might be 
unable to find the optimal 6. 


point 

5. 

Which of the following are reasons for using feature scaling? 

I~l It is necessary to prevent gradient descent from getting 
stuck in local optima. 

I~l It prevents the matrix X T X (used in the normal equation) 
from being non-invertable (singular/degenerate). 

I~l It speeds up solving for 9 using the normal equation. 

I~l It speeds up gradient descent by making it require fewer 
iterations to get to a good solution. 
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1. 

Suppose that you have trained a logistic regression classifier, and it 
outputs on a new example x a prediction he{x) = 0.7. This means 
(check all that apply): 

["I Our estimate for P{y = 0\x;0) is 0.7. 

I~l Our estimate for P(y = l\x]6) is 0.3. 

I~1 Our estimate for P(y = 0\x;6) is 0.3. 

I~l Our estimate for P(y = l\x;6) is 0.7. 

1 

point 

2. 


https:/, 


tf.coursera.org/learn/machine- 
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Suppose you have the following training set, and fit a logistic 
regression classifier he{x) = g(0 0 + 0i*i + 0 2 *2). 


*1 

*2 

y 

1 

0.5 

0 

1 

1.5 

0 

2 

1 

1 

3 

1 

0 


2 
1.5 
1 

0.5 
0 

0 12 3 4 


o 

+ o 
o 


Which of the following are true? Check all that apply. 

I~l Adding polynomial features (e.g., instead using 

Hq{x ) = g(Oo + 0\X\ + 62X2 + 0%x\ + 04*1*2 + ^5*2) 
) could increase how well we can fit the training data. 

1 I At the optimal value of 0 (e.g., found by fminunc), we will 
have J(0) > 0. 

I~l Adding polynomial features (e.g., instead using 

Hq{x ) = g{9[) +01*1 + 62X2 +03*1 + 04*1*2 + 05*2) 
) would increase J(0) because we are now summing over 
more terms. 

r*| If we train gradient descent for enough iterations, for some 
examples *W in the training set it is possible to obtain 
hg(x W) > 1. 
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3. 

For logistic regression, the gradient is given by 

J(9) = Y^iL\ {he{ x ^) ~ y^)x^ . Which of these is a correct 
gradient descent update for logistic regression with a learning rate of 
a ? Check all that apply. 

□ S-.= e-a^YZi {e T x-y®)x®. 

□ Bj ■.= Bj -<sr=i (M* w ) - v®)*f 

(simultaneously update for all j). 



(simultaneously update for all j). 


(simultaneously update for all j). 


point 

4. 

Which of the following statements are true? Check all that apply. 

I~i The cost function J(0) for logistic regression trained with 
m > 1 examples is always greater than or equal to zero. 

I~l For logistic regression, sometimes gradient descent will 
converge to a local minimum (and fail to find the global 
minimum). This is the reason we prefer more advanced 
optimization algorithms such as fminunc (conjugate 
gradient/BFGS/L-BFGS/etc). 

j*~l Linear regression always works well for classification if you 
classify by using a threshold on the prediction made by 
linear regression. 

I~l The sigmoid function g(z ) = 1+ ^_ z is never greater than 
one (> 1). 
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5. 

Suppose you train a logistic classifier he{x ) = g(0Q +6 + O 2 X 2 ). 
Suppose 9q = —6,0i = 0,02 = 1. Which of the following figures 
represents the decision boundary found by your classifier? 

O Figure: 



II 

•< 5 

II 

^ ° s 



cn 

x> 


O Figure: 



"y = 0" 

11 

t ^ 



6 


O Figure: 


*! / 

"y = 0" 

6 

"v = i" 



*1 


O Figure: 
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/ 

2 

V 

"y = l" 

5 

<* 

ii 

f 


7 


*1 
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Regularization 


5 questions 
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1. 

You are training a classification model with logistic 
regression. Which of the following statements are true? Check 
all that apply. 

I~l Introducing regularization to the model always results in 

equal or better performance on examples not in the training 
set. 

I~l Introducing regularization to the model always results in 
equal or better performance on the training set. 

("I Adding a new feature to the model always results in equal or 
better performance on examples not in the training set. 

I~1 Adding many new features to the model makes it more likely 
to overfit the training set. 


1 

point 


2. 
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Suppose you ran logistic regression twice, once with A = 0, and once 
with A = 1. One of the times, you got 

r 26 29 1 

parameters 9 = " , and the other time you got 

00.41 


9 = 



. However, you forgot which value of 


A corresponds to which value of 6 . Which one do you 


think corresponds to A — 1? 


o 


e = 


'26.29' 

65.41 


o 


9 = 


'2.75 

1.32 


point 

3. 

Which of the following statements about regularization are 
true? Check all that apply. 

I~l Because logistic regression outputs values 0 < 1iq(x) < 1, 
it's range of output values can only be "shrunk" slightly by 
regularization anyway, so regularization is generally not 
helpful for it. 

□ Using a very large value of A cannot hurt the performance of 
your hypothesis; the only reason we do not set A to be too 
large is to avoid numerical problems. 

I~l Consider a classification problem. Adding regularization may 
cause your classifier to incorrectly classify some training 
examples (which it had correctly classified when not using 
regularization, i.e. when A = 0). 

I~l Using too large a value of A can cause your hypothesis to 
overfit the data; this can be avoided by reducing A. 
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4. 

In which one of the following figures do you think the hypothesis has 
overfit the training set? 

O Figure: 



O Figure: 



O Figure: 
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O Figure: 



1 
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5. 

In which one of the following figures do you think the hypothesis has 
underfit the training set? 

O Figure: 
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O Figure: 



O Figure: 
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1. 

For which of the following tasks might K-means clustering be a suitable 
algorithm? Select all that apply. 

n Given a database of information about your users, 

automatically group them into different market segments. 

I~l Given sales data from a large number of products in a 
supermarket, figure out which products tend to form 
coherent groups (say are frequently purchased together) 
and thus should be put on the same shelf. 

I~l Given historical weather records, predict the amount of 
rainfall tomorrow (this would be a real-valued output) 

l~l Given sales data from a large number of products in a 
supermarket, estimate future sales for each of these 
products. 


point 


2. 


Suppose we have three cluster centroids = 


M2 = 


and // 3 = 


Furthermore, we have a training example 
. After a cluster assignment step, what will be? 
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o 

o 

o 

o 


c«=3 
c< f > = 2 
c« = 1 


point 

3. 

K-means is an iterative algorithm, and two of the following steps are 
repeatedly carried out in its inner-loop. Which two? 

I~l The cluster assignment step, where the parameters cW are 
updated. 

n Move the cluster centroids, where the centroids fi k are 
updated. 


I~l Feature scaling, to ensure each feature is on a comparable 
scale to the others. 


I~l Using the elbow method to choose K. 


point 

4. 

Suppose you have an unlabeled dataset (a^ 1 ) , . . . , x }. You run K- 
means with 50 different random 

initializations, and obtain 50 different clusterings of the 

data. What is the recommended way for choosing which one of 

these 50 clusterings to use? 

O The answer is ambiguous, and there is no good way of 
choosing. 
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O For each of the clusterings, compute 

m i I |a^ — M c w 1 1 2 ' anc * the one th at minimizes 
this. 

O Always pick the final (50th) clustering found, since by that 
time it is more likely to have converged to a good solution. 

O The only way to do so is if we also have labels y W for our 
data. 


point 

5. 

Which of the following statements are true? Select all that apply. 

n If we are worried about K-means getting stuck in bad local 

optima, one way to ameliorate (reduce) this problem is if we 
try using multiple random initializations. 

I~l The standard way of initializing K-means is setting 
Mi = • • • = Mfc t° be equal to a vector of zeros. 

I~l Since K-Means is an unsupervised learning algorithm, it 

cannot overfit the data, and thus it is always better to have 
as large a number of clusters as is computationally feasible. 

I~l For some datasets, the "right" or "correct" value of K (the 
number of clusters) can be ambiguous, and hard even for a 
human expert looking carefully at the data to decide. 


2 questions unanswered 
Submit Quiz 


6 g p 


https://www.coursera.org/learn/machine-learning/exam/4sGmv/unsupervised-learning 


3/3 


